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Abstract 

The assumption of maximum parallelism support for the 
successful realization of scalable quantum computers has 
led to homogeneous, "sea-of-qubits" architectures. The re- 
sulting architectures overcome the primary challenges of re- 
liability and scalability at the cost of physically unaccept- 
able system area. We find that by exploiting the natural se- 
rialization at both the application and the physical microar- 
chitecture level of a quantum computer, we can reduce the 
area requirement while improving performance. In particu- 
lar we present a scalable quantum architecture design that 
employs specialization of the system into memory and com- 
putational regions, each individually optimized to match 
hardware support to the available parallelism. Through 
careful application and system analysis, we find that our 
new architecture can yield up to a factor of thirteen savings 
in area due to specialization. In addition, by providing a 
memory hierarchy design for quantum computers, we can 
increase time performance by a factor of eight. This result 
brings us closer to the realization of a quantum processor 
that can solve meaningful problems. 

1 Introduction 

Conventional architectural design adheres to the concept 
of balance. For example, the register file depth is matched 
to the number of functional units, the memory bandwidth to 
the cache miss rate, or the interconnect bandwidth matched 
to the compute power of each element of a multiprocessor. 
We apply this concept to the design of a quantum com- 
puter and introduce the Compressed Quantum Logic Array 
(CQLA), an architecture that balances components and re- 
sources in terms of exploitable parallelism. The primary 
goal of our design is to address the problem of large area, 
approximately 1 m^ on a side, of our previous design Q]. 



Specifically, we discover that the prevailing approach to 
designing a quantum computer, that of supporting maxi- 
mal parallelism, is area inefficient. We also find that ex- 
ploitable parallelism is inherently limited by both resource 
constraints and application structure. This lack of paral- 
lelism gives us the freedom to increase density by special- 
izing components as blocks of memory and blocks of com- 
putation. 

We introduce the idea of periodically reducing our in- 
vestment in reliability and thereby increasing speed. By 
encoding the compute regions differently than memory we 
provide very fast compute regions, while allowing the mem- 
ory to be slower and more reliable. To ensure that the faster 
compute region does not suffer from too many stalls, we 
employ a quantum memory hierarchy wherein the cache uti- 
lizes the same encoding mechanism as the compute region. 
When making this effort to improve speed, it is critical that 
overall system fidelity is maintained. We show how this can 
be accomplished. 

Due to the quantum no-cloning theorem f2l, it is neces- 
sary for all quantum data to physically move from source to 
destination. We cannot create a copy of the data and send 
the copy. Our architecture focuses on implementation with 
an array of trapped atomic ions, one of the most mature and 
scalable technologies that provides a wealth of experimen- 
tal data. In ion-traps, the physical representation of data are 
ions that are in constant motion, on a two dimensional grid, 
throughout the computation. Since this physical movement 
is slow, yet unavoidable, it limits available parallelism at the 
microarchitecture level. 

At the application level, we find that only a limited 
amount of parallelism can be extracted from key quantum 
algorithms. This means that we may only need a few com- 
pute blocks for all the qubits in memory. This is in contrast 
to the popular "sea of qubits" model which allows compu- 



tation at every qubit. Our results show up to a 13X increase 
in density, particulary important in addressing our primary 
goal, and a speedup of about 8. The large area improvement 
brings the engineering of a quantum architecture closer to 
the capabilities of current implementation technologies. 

The choice of quantum error correction codes (ECC) in- 
fluences our results and the architecture. In our special- 
ized architecture analysis, we use the previously consid- 
ered Steane [[7,1,3]] code ^ and utilize a newly optimized 
Bacon-Shor [[9, 1,3]] code 013. The [[9, 1,3]] code, though 
larger than the [[7,1,3]] code since it uses more physical 
qubits to encode a single logical qubit, requires far fewer 
resources for error-correction |6 1, thus reducing the overall 
area and increasing the speed. 

Furthermore, we find that communication is generally 
dominated by computation for error correction. This com- 
putation allows us to absorb the cost of moving data be- 
tween different regions of the architecture. Error correction 
is so substantial, in fact, that quantum computers do not suf- 
fer from the memory wall faced by conventional computers. 
Thus our dense structure with a communication infrastruc- 
ture based on our prior work UJ can accommodate applica- 
tions with highly-demanding communication patterns. 

In summary, the contributions of this work are: 1) Our 
specialized architecture, the CQLA, successfully tackles the 
issue of size, which has been the biggest drawback facing 
large-scale realizable quantum computers. 2) We show that 
current parallelism in quantum algorithms is inherently lim- 
ited and consideration of physical resources and data move- 
ment restrict it even further. 3) We present and analyze the 
abstractions of memory, cache and computation units for a 
quantum computer; based on the insight that we can reduce 
reliability for the compute units and cache without sacri- 
ficing overall computation fidelity. This approach helps us 
significantly increase the performance of the system. 

The paper is organized as follows. Section |2] provides 
a background of the homogeneous QLA architecture and 
the low-level microarchitecture assumptions of our system. 
Section|3lmotivates the specialized CQLA architecture and 
introduces the architectural abstractions. Thereafter we dis- 
cuss how the Steane and Bacon-Shor error correction codes 
affect the design of the CQLA. Results and analysis of our 
abstractions are the focus of section |5] following which we 
provide details of computation versus communication re- 
quirements of the most widely accepted quantum applica- 
tions. We end with future directions in Section and our 
conclusions in Section|8l 

2 Background 

Our architectural model is built upon our previous work 
on the Quantum Logic Array (QLA) architecture Q. The 



QLA architecture is a hierarchical array-based design that 
overcomes the primary challenges of scalability for large- 
scale quantum architectures. It is a homogeneous, tiled ar- 
chitecture with three main components: logical qubits im- 
plemented as self-contained computational tiles structured 
for quantum error error correction; trapped atomic ions 
as the underlying technology; finally, teleportation-based 
communication channels utilizing the concept of quantum 
repeaters to overcome the long-distance communication 
constraints. 

2.1 The Logical Qubit 

The basic structure of the QLA, our prior work, imple- 
ments a fault-tolerant quantum bit, or a logical qubit as 
a self-contained tile whose underlying construction is in- 
tended for quantum error correction, by far the most domi- 
nant and basic operation in a quantum machine Q. Quan- 
tum error correction is expensive because arbitrary relia- 
bility is achieved by recursively encoding physical qubits 
at cost of exponential overhead. Recursive error correc- 
tion works by encoding A^ physical ion-qubits into a known 
highly-correlated state that can be used to represent a single 
logical data qubit. This data qubit is now at level 1 recur- 
sion and may have the property of being in a superposition 
of "0" and "1" much like a single physical qubit. Encoding 
once more we can create a logical qubit at level 2 recursion 
with A^^ physical ion-qubits. With each level, L, of encoding 
the probability of failure of the system scales as p^ , where 
po is the failure rate of the individual physical components 
given a fault-tolerant arrangement and sequence of opera- 
tions for the lower level components. The ability to apply 
logical operations on a logical qubit without the need to de- 
code and subsequently re-encode the data is key to the ex- 
istence of fault-tolerant quantum microarchitecture design, 
where arbitrary reliability can be efficiently reached through 
recursive encoding. 

The logical qubits in the QLA are arranged in a regular 
array fashion, connected with a tightly integrated repeater- 
based lis] interconnect. This makes the high-level design of 
the QLA very similar to classical tile based architectures. 
The key difference is that the communication paths must 
account for data errors in addition to latency. Integrated re- 
peaters known as teleportation islands redirect qubit traffic 
in the 4 cardinal directions by teleporting data from one is- 
land to the next. This interconnect design is one of the key 
innovative features of QLA architecture, as it allows us to 
completely overlap communication and computation, thus 
eliminating communication latency at the application level 
of the program. 

Anticipating technology improvements in the near future 
we found that for performing large, relevant instances of 
Shor's factoring algorithm, sufficient reliability is achieved 
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Figure 1. (a) A simple schematic of the basic elements of a planar ion-trap for quantum computing. Ions are trapped 
in any of the trapping regions shown and ballistically shuttled from one trapping region to another. When two ions are 
together a two-qubit gate can be performed, (b) Our abstraction of the ion-trap layout. Each trapping region can hold up 
to two ions for two-qubit gates. The trapping regions are interconnected with the crossing junctions which are treated as 
a shared resource. 



at level 2 encoding per logical qubit using the Steane 
[[7,1,3]] error correction code |9|. In the QLA, computa- 
tion could occur at any logical qubit and each logical gate is 
followed by an error correction procedure. To preserve ho- 
mogeneity and maximum flexibility for large-scale applica- 
tions each logical qubit was accompanied by all necessary 
error correction auxiliary qubit resources such that compu- 
tational speed was maximized. This amounted to a (1 : 2) 
ratio between logical data qubits and ancillary qubits. 



2.2 



Low-Level 
Model 



Physical Architecture 



At the lowest level our architecture design is based on the 
ion-trap technology for quantum computation. Initially pro- 
posed by Cirac and Zoller in 1 995 1 1 1 , the technology uses 
a number of atomic ions that interact with lasers to quantum 
compute. Quantum data is stored in the internal electronic 
and nuclear states of the ions, while the traps themselves 
are segmented metal traps (or electrodes) that allow indi- 
vidual ion addressing. Two ions in neighboring traps can 
couple to each other forming a linear chain of ions whose 
vibrational modes provide qubit-qubit interaction used for 
multi-qubit quantum gates 1 11 1 2 1 . Together with single bit 
rotations this yields a universal set of quantum logic gates. 
All quantum logic is implemented by applying lasers on 
the target ions, including measurement of the quantum state 
ll3lll4l[T5lll6l . Sympathetic cooling ions absorb vibrations 
from data ions, which are then dampened through laser ma- 
nipulation I17II18I . Recent experiments |19 20 21 1 have 
demonstrated all the necessary components needed to build 
a large-scale ion-trap quantum information processor. Fi- 
nally, multiple ions in different traps can be controlled by 
focusing lasers through MEMS mirror arrays 1221 . 



Operation 


Time /jf, now(future) 


Failure Rate now(future) 


Single Gate 


1(1) 


10-* (lo-**) 


Double Gate 


10(10) 


0.03 (10-'') 


Measure 


200 (10) 


0.01 (10^**) 


Movement 


20 (10) 


0.005 (5x W-'^)/ijm 


Split 


200(0.1) 




Cooling 


200(0.1) 




Memory time 


10 to 100 sec 




Trap Size 


~200(l-5)^m 





Table 1. Column 1 gives estimates for execution 
times for basic physical operations used in the QLA 
model. Currently achieved component failure rates 
are based on experimental measurements at NIST 
with ^Be+ ions, and using ^'^Mg^ ions for sympathetic 
cooling fT^T^I. All parameters are followed by their 
projected parameters in parenthesis, extrapolated fol- 
lowing recent literature |53||53|5S|, and discussions 
with the NIST researchers; these estimates are used 
in modeling the performance of our architecture. 

Figure [2 shows a schematic of the physical structure of 
an ion trap computer element. In Figure [T(a)] we see a single 
ion trapped in the middle trapping region. Trapping regions 
are the locations where ions can be prepared for the execu- 
tion of a logical gate, which is implemented by an external 
laser source pulsed on the ions in the trap. In the figure we 
see an ion moving from the far right trapping region to the 
top-right for the execution of a two-bit logical operation. 

Figure [lXE}] demonstrates our abstraction of the physical 
ion-trap layout. The layout can be represented as a collec- 
tion of trapping regions connected together through shared 
junctions. A fundamental time-step, or a clock cycle, in 
an ion-trap computer will be defined as any physical, un- 
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Figure 2. For a 64-qubit adder, the amount of paral- 
lelism that can be extracted when resources are un- 
limited, and when the number of gates per cycle are 
limited. This figure shows that if 15 gates, or an un- 
limited number of gates could be performed in each 
cycle, the total runtime would remain the same, com- 
pute blocks increases. 



encoded logic operation (one-bit or two-bit), a basic move 
operation from one trapping region to another, and measure- 
ment. Table [Q summarizes current experimental parame- 
ters and corresponding optimistic parameters for ion-traps. 
In our subsequent analysis we will assume that each clock 
cycle for a fundamental time-step has a duration of 10 /js, 
failure rates are 10^^ for single-qubit operations and mea- 
surement, 10^^ for CNOT gates 1251 . and order of 10^^ per 
fundamental move operation. The movement failure rate 
is expected to improve from what it is now as trap sizes 
shrink and electrode surface integrity continues to improve. 
We will assume trap sizes of 5/jm each [261, and on the or- 
der of 10 electrodes per trapping region I27J. which gives 
us a trapping region dimension (including the junction) of 
50/jm. The parameters chosen for our study are optimistic 
compared to 1281 and L29J . Both of those papers, assume 
more pessimistic near term parameters which are useful for 
building a 100 bit prototype, but probably not a scalable 
quantum computer that can factor 1024-bit numbers using 
Shor's algorithm. Based on the quantum computing ARDA 
roadmap |23|, we feel justified in using aggressive parame- 
ters when looking 10-15 years into the future. 



3 Architectural Abstractions 

This section motivates the need for a compact architec- 
ture for quantum processors and describes our design the 
CQLA (Compressed Quantum Logic Array). We discuss 
how separation into memory and compute regions benefits 
the CQLA and then present our quantum memory hierarchy. 



3.1 Motivation 

Conventional quantum processor designs are based on 
the sea-of-qubits design and allow computation to take 
place anywhere in the processor This design philosophy 
follows the idea of maximum parallelism and is employed 
in our previous work 1 1 1 . The area consumption of such a 
design however, is untenably large, about 1 m^ to factor a 
1024-bit number. 

When we consider the amount of available parallelism in 
quantum applications, we discover that much is to be gained 
by limiting computation to a specifically designated loca- 
tion. The remaining area can be optimized for storage of 
quantum data. A good example for the benefit of specializa- 
tion in quantum applications is the Draper carry-lookahead 
quantum adder |30|, which forms a basic basic component 
of Shor's quantum factoring algorithm 1 3 1 1 . Figure|2shows 
that providing unlimited computational resources for a 64- 
bit adder does not offer a performance benefit over limiting 
the computation to 15 locations. As illustrated in Section 
12] the number of ancillary resources for each data location 
where computation is allowed is twice as large. In this ex- 
ample, by providing only 15 compute locations instead of 
64, we can reduce the area consumed by the adder by ap- 
proximately half and yet have no change in performance. 

3.2 Specialized Components 

The facts that qubits in an ion-trap quantum processor 
have large lifetimes when idle, allows us to improve logical 
qubit density in the memory. Qubits in memory can wait for 
a longer time period between two consecutive error correc- 
tions. We use this to significantly reduce the error correction 
ancillary resources in memory, thereby reducing its density. 
The majority of computation, on the other hand, is an in- 
teraction between two distinct logical qubits. To maintain 
adequate system fidelity, every gate must be followed by an 
error correction procedure. Consequently, a quantum pro- 
cessor spends most of its time performing error correction 
and the compute regions are designed to allow fast error 
correction by providing a greater number of ancilla in the 
logical qubits. Figure |3(a)| shows a specialization into com- 
pute and memory regions. The ratio of (data: ancilla) can be 
seen to be (8 : 1) for memory and (1:2) for the compute 
region. 

While specialization helps address our primary goal of 
reducing size, it can possibly also reduce performance. In 
Section|5]we show how judiciously choosing the size of the 
compute region helps maintain adequate performance while 
simultaneously reducing size . 

3.3 Quantum Memory Hierarchy 

Another important architectural design choice is the ef- 
fect of the error correction code chosen in both the mem- 
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Figure 3. (a) Memory is denser since it has fewer ancilla qubits. The figure shows 3 data qubits in the compute block 
which take the same area as 8 data qubits in memory. In the CQLA each compute block holds nine 9 data qubits and 18 
ancilla. Both compute and memory are at level 2 encoding, (b) IVIemory is at level 2 encoding, while the compute and 
cache are at level 1 encoding. The complete CQLA consists of memory at level 2, compute regions at level 2 and also 
a cache and compute region at level 1. 



cry and the compute regions. Error correction is the most 
dominant procedure and the resources used increase expo- 
nentially with each level of concatenation. In addition to 
resources, the time to error correct increases exponentially 
with each level of concatenation. The benefits of concate- 
nated error correction are that the reliability of each op- 
eration increases double exponentially, thus allowing far 
greater number of total operations to be performed. For any 
application, all logical qubits are not being acted upon by 
gates for the entire duration of the algorithm. In fact, just 
like classical computers, data locality is a common phe- 
nomenon. This implies that a logical qubit could start at 
level 2 encoding, be encoded at level 1 during the peak in 
its activity and return to level 2 when idle. 

We now introduce a quantum memory hierarchy, in addi- 
tion to the speciaUzed design. Memory at level 2, which is 
optimized for area and reliability will be inherently slower 
than a computational structure, at level 1 , optimized for gate 
execution. This necessitates the need for a cache that can al- 
leviate the need for constant communication. 



Figure |3(b)] outlines this approach, the separation be- 
tween memory and compute regions. The cache and the 
compute regions here are similar to FigureR^a^lin every way 
save that they are at a lower level of encoding. In the mem- 
ory hierarchy, memory and cache have a similar design, 
only memory is at a higher level of encoding, and hence is 
slower and much more reliable. The critical feature here is 
the transfer network which is more complicated and hence 
slower than the teleporation channels described above. The 
transfer network comes into play only when we change the 
encoding of a logical qubit. For all other communication 



(within compute blocks, between cache and compute blocks 
and within memory) teleportation is still the chosen mech- 
anism. Section|4]describes how the transfer process is per- 
formed in a fault-tolerant manner. 

4 Error Correction and Code Transfer 

In this section we describe the cost of the error correc- 
tion circuits and code-transfer networks we use when a spe- 
cific physical layout is considered. Section lT^ describes in 
detail our technology parameters, which we find to be nec- 
essary for such a large-scale architecture. These parameters 
allow the large scalability to be achieved because the phys- 
ical component failure rates are below the threshold value 
needed for efficient error correction 1321 . 



4.1 Error Correction Codes 

Some of the best error correction codes (ECC) are ones 
that use very few physical qubits, and allow "easy" fault- 
tolerant gate implementations. A requirement of a fault- 
tolerant system is that computation proceeds without decod- 
ing the encoded data. Thus logical gates are implemented 
directly on encoded qubits, ensuring that errors introduced 
during the gate can be corrected. Many code choices for 
EC allow transversal logical gate implementation, which 
means that the same physical gate acts on each lower-level 
qubit. 

Each logical quantum gate is preceded and followed by 
an error correction procedure. The EC procedure works 
by encoding ancillary qubits in the logical "0" state of the 
data and interacting the data and the ancilla. The interaction 
causes errors in the data to propagate to the ancilla and to 



be detected when the ancilla is measured. There are several 
very important logical gates that we must consider during 
error correction. The bit-flip gate, X flips the value of the 
qubit by reversing the probabilities between its "0" com- 
ponent and its "1" component. The phase-flip gate, Z, acts 
only on the qubit's "1" component by changing its sign. The 
most important gate is the controlled-X gate (denoted as the 
CNOT gate) which flips the state of the target qubit whenever 
the state of the control qubit is set. Errors on the data can 
be understood as the product of phase-flips and a bit-flips. 
A syndrome is extracted for each types of error We only 
present the cost of error correction networks and details rel- 
evant to building a large-scale architecture. The interested 
reader can refer to the literature for additional theoretical 
information 1331 . 
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Figure 4. A high-level view of an error correction se- 
quence. Two syndromes for bit-flip and phase-flip er- 
rors are extracted. 



Error Correction Metric Summary | 


Architecture Metric 


Error Code - Level 


Value 


EC Time (seconds) 


[[7,1,3]] -LI 


3.1 xlO-^' 




[[7,1,31 -L2 


0.3 




[[9,1,3] -LI 


1.2 xlO-3 




[[9,1,31 -L2 


0.1 


Qubit Size 


[[7,1,3]] -LI 


0.2 


(mm ) 


[[7,1,3]] -L2 


3.4 




[[9,1,31 -LI 


0.1 




[[9,1,31 -L2 


2.4 


Transversal Gate 


[[7,1,31 -LI 


6.2 X 10-^' 


Time (seconds) 


[[7,1,3] -L2 


0.5 




[[9,1,3] -LI 


2.4 X 10"^'' 




[[9,1,3] -L2 


0.2 


Size, number of 


[[7,1,3] -LI 


7 


logical qubits 


[[7,1,3] -Ll(ancilla) 


21 




[[7,1,3] -L2 


49 




[[7,1,3] -L2(ancilla) 


441 




[[9,1,3] -LI 


9 




[[9,1,3] -Ll(ancilla) 


12 




[[9,1,3] -L2 


81 




[[9,1,3] -L2(ancilla) 


298 



Table 2. Error Correction Metric Summary. Given 
the fact that we use optimistic ion-trap parameters all 
numbers are estimates and are thus rounded to only 
one significant digit. 



Figure 0] is a simple schematic of the general error cor- 
rection procedure, where time flows from left to right and 
each line represents the evolution of an encoded logical 
qubit. An error correction code is labeled by [[«,fe,(i]] , en- 
coding k logical qubits into n qubits and correcting {d — 
l)/2 errors. If our target reliability is such that we require 
L levels of recursion, each line in Figure |3 represents n^ 
level zero qubits. For the bit-flip error syndrome the an- 
cilla are encoded into the logical (0 + 1 ), and the transversal 
CNOT gate, which is essentially n level (L — 1 ) transversal 
CNOT gates of which the ancillary qubits are targets. Each 
of the lower level CNOT gates is followed by a lower level 
error correction unless the lower level is zero. In our ar- 
chitecture analysis we provide information about two error 
correcting codes: the Steane [7, 1,3]] code ||9|, and an im- 
proved version of the Shor [9,1,3]] code 1341 denoted as the 
Bacon-Shorcode |4 5 6]. 

The Steane [[7,1,3]] Code encodes 1 qubit into 7 qubits, 
and is the smallest error correction code allowing transver- 
sal gate implementation for all gates involved in concate- 
nated error correction algorithms. The addition of the T 
phase gate, which is harder to implement, provides univer- 
sal quantum logic using the [[7, 1,3]] error correcting code. 
For this reason it was used as the underlying error correcting 
code in the analysis of the QLA architecture Q. It consists 
of 7 data ions which encode our logical level 1 qubit with 14 
ancillary ions used for error correction, seven of which are 
used in the error correction and the other verify the ancilla. 



Considering communication, the level 1 error correction 
circuit in will take 154 cycles, where each cycle is in the 
order of 10 microseconds, and can be as large as 0.003 per 
error correction procedure at level 1 . A level 2 [[7,1, 3]] qubit 
will be composed of 7 level 1 data qubits and 7 level 1 an- 
cilla qubits - there is no need for verification ancilla at L = 2. 
The size of a level 2 qubit will be 3.4 mrrp-, and a fully seri- 
alized error correction will last approximately 0.3 seconds 
(this is two orders of magnitude more than the time to error 
correct at level 1). 

Bacon-Shor [[9, 1,3]] Code: The [[9, 1,3]] code was the first 
error correcting code to be discovered for arbitrary errors 
[34]. Recent observations make this code faster and spa- 
tiaUy smaller than the [[7,1,3]] code 0|5]|6|. The compact 
structure of the physical layout for the [[9, 1,3]] code sig- 
nificantly improves communication requirements. At level 
1 the error correction time is only 0.001 seconds and 0.1 
seconds at level 2. The level 2 qubit size is approximately 
2.4 mm^. Table |2] summarizes the error correction we have 
used and their parameters for some useful architecture met- 



4.2 Code Transfer Networks: Overview 

One of the most interesting components of the memory 
hierarchy are the code transfer regions. This region trans- 
fers data encoded in code CI to a second code C2 with- 



(seconds) 


7-Ll 


7-L2 


9-Ll 


9-L2 


7-Ll 





0.6 


0.02 


0.2 


7-L2 


1.3 





1.3 


1.5 


9-Ll 


0.01 


0.5 





0.1 


9-L2 


0.4 


0.9 


0.4 






Table 3. Transfer network latency for a combination 

of the [[7, 1,3]] and [[9, 1,3]] codes. 



out the need to decode. Figure 14721 illustrates this concept. 
The transfer network teleports the data in CI to C2, where 
CI and C2 may be any two error correcting codes. The 
code teleportation procedure works much the same way 
as standard data teleportation that is used for communica- 
tion. A correlated ancillary pair is prepared first between 
CI and C2 through the use of a multi-qubit cat-state (i.e. 
"(00... 0+ 11...1)"). The data qubit interacts with the equiv- 
alently encoded ancillary qubit through a CNOT gate, and 
the two are measured. Following the measurement the state 
of the data is recreated at the C2 encoded ancillary qubit. 
This process is required every time we transfer a qubit from 
memory to the cache or vise-versa. Table|3lsummarizes the 
times for different code transfer combinations between lev- 
els 1 and 2 for the [[7,1, 3]] and the [[9, 1 , 3]] codes. 
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Figure 5. Code Teleportation Network from Code 1 
(C1 ) to Code 2 (C2) C1 and C2 can even be the same 
error correcting code, but different levels of encoding. 
The solid triangles denote an error correction step. 



5 CQLA Analysis and Results 

This section provides analysis of the abstractions pre- 
sented in 

to perform quantum modular exponentiation. 



5.1 Specialization into Memory 

We now analyze our design, the CQLA, when it sep- 
arates the quantum processor into memory and compute 
regions. High density in memory is achieved by greatly 
reducing the ratio of logical data qubits to logical ancilla 



qubits, which is (8 : 1) in memory and is (1 : 2) in the com- 
pute regions. This greatly reduces overall area since prior 
work had a ratio of (1 : 2) throughout the architecture. Thus 
the memory is denser, but slower, which is permissible due 
to the large memory wait times ^ 

Quantum modular exponentiation is the most time con- 
suming part of Shor's algorithm, and the Draper carry- 
lookahead adder is its most efficient implementation. This 
adder comprises single qubit gates, two qubit cnot gates and 
three qubit toffoli gates and is dominated by toffoli gates. 
The time to perform a single fault-tolerant toffoli is equal 
to the time for fifteen two qubit gates, each of which is fol- 
lowed by an error-correction step. Table l5.ll shows the sav- 
ings that can be achieved when using denser memory. Note 
that performance is minimally impacted for the Steane Code 
as we exploit the limited parallelism in the adder. We ad- 
dress the parallelism available within the application itself 
and determine the number of compute blocks to maximally 
exploit this parallelism with change with problem size. Fig- 
ure |6(a)] shows how for a fixed problem size, utilization of 
each compute block decreases with an increase the num- 
ber of compute blocks. Clearly, the decrease in utilization 
is offset by the increase in overall performance. Thus the 
challenge here is to find the balance between utilization and 
performance. 

We compare all our results to 1 1 1, which used only 
the Steane ECC. Since the Bacon-Shor ECC uses fewer 
overall resources |2] and allows faster error-correction, a 
design based on these codes not only is much smaller, 
but is also faster. The CQLA, thus reduces area re- 
quired by a factor of 9 with minimal performance 
reduction for the Steane ECC and by a factor of 13 
with a speedup of 2 when using the Bacon-Shor ECC. 
To compare the relative merit our design choices, we 
use the gain product which can be defined by GP ~ 
(Areaoid * AdderTimeoUi) / [AreacQLA * AdderTimecgiA) 
where AdderTime is the average time per adder for 
modular exponentiation. The gain product indicates the 
improvement in system parameters relative to our prior 
work, the QLA. The higher the gain product, the better the 
collective improvement in area and time of our system. 

Communication Issues: Toffoli gates cannot be di- 
rectly implemented on encoded data and have to be broken 
down into multiple two qubit gates. Performing a fault- 
tolerant Toffoli between three logical qubits requires ex- 
tra logical ancilla and logical cat-state qubits. The flow of 
data between these nine qubits to complete a single toffoli 
forms the most intense communication pattern during the 
entire addition operation. To study the bandwidth require- 
ments during the toffoli gates, we developed a scheduler 
that would try to have all the requirements for communica- 
tion (creating EPR pairs, transporting and purifying them) 



Input 
Size 


Compute 
Blocks 


Area Reduced (Factor of) 


SpeedUp 


Gain Product 


St-Code 


BSr-Code 


St-Code 


BSr-Code 


St-Code 


BSr-Code 


32-bit 


4 
9 


6.69 

3.22 


9.80 

4.74 


0.54 
0.97 


1.47 
2.9 


3.61 
3.14 


14.41 
13.74 


64-bit 


9 
16 


6.36 
3.79 


9.32 
5.56 


0.70 
0.98 


1.92 
3.0 


4.45 
3.71 


17.70 
16.68 


128-bit 


16 

25 


7.24 
4.90 


10.6 

7.17 


0.72 
0.96 


1.97 
2.84 


5.24 
4.70 


20.88 
20.36 


256-bit 


36 
49 


6.65 
5.07 


9.47 
7.43 


0.92 
0.98 


2.51 
2.98 


6.12 
4.96 


23.68 
22.14 


512-bit 


64 
81 


7.42 
6.06 


10.87 
8.87 


0.92 
0.98 


2.50 
2.91 


6.80 
5.94 


27.18 
25.81 


1024-bit 


100 

121 


9.14 

7.81 


13.4 

11.45 


0.80 

0.97 


2.19 

2.65 


7.35 
7.60 


29.35 

30.34 



Table 4. For various size inputs, tliis table sliows how the CQLA performs for IVIodular Exponentiation. The space 
saved due to compressing the memory blocl<s and separating memory and compute regions is shown as compared to 
prior work 1 1 . St-Code is the Steane ECC and BSr-Code is the Bacon-Shor code. The Gain Product is compared with 
our prior work, the QLA, which has a Gain Product of 1 .0. 



in place while the logical qubit to be transported was under- 
going error-correction after completion of the previous gate. 
With bandwidth of one channel, it was possible to overlap 
communication with computation for the Steane [[7,1,3]] 
code. To enable this overlap when using the Bacon-Shor 
code, the required bandwidth was three channels. Table |2] 
shows that while a logical qubit encoded in the Bacon-Shor 
code is smaller when ancilla are considered; it has more 
data qubits than the Steane code. Since only data qubits 
are involved during teleportation, the time for teleporting 
a logical qubit in the Bacon-Shor code is greater In addi- 
tion, the Bacon-Shor codes take far fewer error-correction 
cycles. These two factors push its bandwidth requirement 
higher. Note that the higher bandwidth is accounted for in 
results of Table Ism 



Superblocks: In the CQLA, several compute blocks to- 
gether form compute superblocks. This is done to exploit 
the locality inherent to an application. Having larger su- 
perblocks also increases the perimeter bandwidth between 
the compute and memory regions of the CQLA. This in- 
crease in bandwidth of a larger superblock is offset by the 
much greater increase in communication required. Our in- 
tuition tells us that at a certain point, it may be more effi- 
cient to have multiple small superblocks instead of one large 
superblock. To determine this number concretely, we plot 
the change in bandwidth required against change in band- 
width available. Figure |6(b)| shows the cross-over point is 
36 compute blocks per superblock, immaterial of what error 
correction code is used. Thereafter it is no longer beneficial 
to increase the size of an individual compute superblock. 



5.2 Memory Hierarchy 

Reducing the encoding level of the compute region will 
dramatically increase its speed. Recall that resources, time 
and reliability all increase exponentially as we increase the 
level of encoding. With the compute region at level 1 and 
memory at level 2, the challenge is the very familiar one of 
the CPU being an order of magnitude faster than the mem- 
ory. To maximize the benefit of a much faster compute re- 
gion, we introduce the quatum memory hierarchy. In our 
hierarchy, the memory is at level 2 encoding (slow and reli- 
able), cache is at level 1 (faster, less reliable) and the com- 
pute region is also at level 1 (fastest and same reliability as 
cache). The difference in speed between the compute re- 
gion and the cache is the due to a greater number of ancilla 
in the compute region. 

To study the behavior of the CQLA with a cache and 
multiple encoding levels, we developed a simulator that 
models a cache. The simulator takes into account the com- 
putation cost in both encoding levels and also the cost of 
transferring logical qubits between encoding levels. The 
application under consideration is still the Draper carry- 
lookahead adder. Input to the simulator is a sequence of in- 
structions; each instruction is similar to assembly language 
and describes a logical gate between qubits. We have writ- 
ten generators that output this code in a form that can take 
advantage of an architecture with maximal parallelism. 

When the simulator runs this code in the sequence in- 
tended by the Draper carry-lookahead adder, the cache hit- 
rate is limited to 20%. To improve the hit-rate, we utilize 
the following optimized approach. Since we are schedul- 
ing statically, the instruction fetch window for the simulator 
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Figure 6. (a) Change in utilization as the number of compute blocks increases, (b) The point of intersection of the two 
bottom curves is the optimal size of a compute superblock. These two curves are bandwidth required (at the perimeter 
of the compute superblock) in modular exponentiation and bandwidth available. The third steep curve, is the worst case 
bandwidth required. 
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Figure 7. Shows the cache hitrate for different 
adders when both cache and compute region are at 
Level 1 recursion. Largest cache considered holds 
twice the number of logical qubits as the compute 
block. Results for both the non-optimized version and 
the optimized version are shown. 



can be the whole program. The simulator takes advantage 
of this by first creating a dependency list of all input instruc- 
tions. Then it carefully selects the next instruction such that 
probability of finding all required operands in the cache is 
maximized. This optimized fetch yields a cache hit-rate of 
almost 85% immaterial of adder size and cache size. The re- 
placement policy in the cache is least recently used. Figure 
0shows the cache hit-rates for different sized adders for the 
non-optimized and optimized instruction fetch approaches. 
If n is the number of logical qubits in the compute region, 
the cache sizes we studied were n, 1 .5n and 2n. As the graph 
shows, the increase in hit-rate is more pronounced due to the 
optimized fetch than increasing cache size. For the CQLA, 
we thus employ a cache size of twice the number of qubits 
in the compute region. The high hit-rate means the transfer 
networks will not be overwhelmed. 

Fault-tolerance with multiple encoding levels: A quan- 
tum computer running an application of size S = KQ, where 
K is the number of time-steps and Q is the number of logical 
qubits, will need to have a component failure rate of at most 
Pf = l/KQ. To evaluate the expected component failure 
rate at some level or recursion we use Gottesman's estimate 
for local architectures t35j shown in Equation[3below. 
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The value for r is the communication distance between 
level 1 blocks which are aligned in QLA to allow r — 12 
cells on average and L denotes the level of recursion. The 



Par Xfer 


Adder Size 


LI SpeedUp 


L2 SpeedUp 


Adder SpeedUp 


Area Reduced 


Gain Product 


Steane[[7,l,3]]Code 


10 


256 
512 
1024 


17.417 
17.41 
18.18 


0.98 
0.97 
0.88 


6.25 
6.33 
4.93 


5.07 
6.06 
9.14 


31.68 
38.38 
45.06 


5 


256 
512 
1024 


10.409 
10.408 
10.96 


0.98 
0.97 
0.88 


4.05 
4.04 
2.94 


5.07 
6.06 
9.14 


24.99 
24.48 
26.87 


Bacon-Shor[[9,l,3]]Code 


10 


256 
512 
1024 


9.61 
9.61 
10.15 


1.53 
2.28 
2.00 


5.92 
8.82 
8.10 


7.43 
8.87 
13.4 


43.99 
78.23 
108.53 


5 


256 
512 
1024 


5.17 
5.17 
5.49 


1.53 
2.28 
2.00 


3.66 
5.45 
4.99 


7.43 
8.87 
13.40 


27.19 
48.37 
66.90 



Table 5. This table shows the results of incorporating a memory hierarchy and two separate encoding levels. Depending 
on the number of parallel transfers possible between memory and cache, we can expect different speedup values for 
the adder at level 1 . This combined with results from Table l5.1l aive us the final Gain Product. Comparatively, prior work 
has an Gain Product number of 1 .0. 



threshold failure rate, p,/,, for the Steane [[7,1,3]] circuit ac- 
counting for movement and gates was computed in 1361 to 
be approximately 7.5 x 10^^. Taking as po the average of 
the expected failure probabilities given in Table Q and us- 
ing Equation n we find that for our system to be reliable 
it can spend only 2% of the total execution time in level 
1. Recall that error-correction is the most frequently pe- 
formed operation in the CQLA. For the Steane code, level 
2 error correction takes 0.3 sec and level 1 takes 3.1 x 10^^ 
sec, which is approximately 1% of the level 2 time. Thus 
if all operations performed by the CQLA were equally di- 
vided between level 1 and level 2 operations, the system will 
maintain its fidelity. The Bacon-Shor ECC can be analyzed 
in a similar manner and their results are more favourable 
due to a higher threshold. 

The CQLA architecure now consists of a memory at 
level 2, a compute region also at level 2, a cache and a com- 
pute region at level 1 and transfer networks for changing the 
qubit encoding levels. Since quantum modular exponentia- 
tion is perfomed by repeated quantum additions, we could 
perform half of these additions completely in level 2 and the 
other half in level 1 . To comfortably maintain the fidelity of 
the system, we perform one level 1 addition for every two 
level 2 additions. The resulting increase in performance is 
shown in Table |5] 



6 Application Behavior 

In this secti compute and memory are at level 2 encod- 
ing. Contrary to traditional silicon based processors, in the 



CQLA a single communication step does not take longer 
than the computation of a single gate. The reason behind 
this phenomenon is the lack of reliability of quantum data, 
which forces us to perform an error-correction procedure 
after each gate. The time to complete a fault-tolerant Tof- 
foli is about 20 times greater than a two-qubit CNOT gate. 
The applications we study are modular exponentiation and 
the quantum fourier transform. 

6.1 Shor's Algorithm 

Shor's algorithm is the most celebrated of quantum algo- 
rithms due to its potential exponential advantage over con- 
ventional algorithms and its application to breaking public- 
key cryptography [31 1. Shor's algorithm is primarily com- 
posed of two parts, the modular exponentiation and the 
quantum fourier transform. 

Modular Exponentiation: The execution of modular ex- 
ponentiation is dominated by Toffoli gates. To keep the 
compute block from having to wait for qubits, and hence 
stalling, the bandwidth around the perimeter of the com- 
pute block has to accommodate the transfer of three qubits 
to and from memory. Intuitively, since the CQLA is a mesh, 
and the bottleneck in bandwidth will be at the edge of the 
compute blocks, having adequate bandwidth at this edge is 
sufficient for the rest of the mesh. 

Based on the communication results from fTl, we calcu- 
late that a 2 channels on the perimeter of the compute block 
would provide adequate bandwidth for all required commu- 
nication. We compute the time required for all communica- 
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Figure 8. Total communication and computation times for the two components of Slior's algoritlim, (a) IVIodular Expo- 
nentiation (b) Quantum Fourier Transform (QFT). Although communication is significant in the QFT, Modular exponenti- 
ation dominates Shor's algorithm. Both these results are for the Bacon-Shor code 



tion steps and compare it against the total computation time 
for differently sized adders. The result is shown in |8(a)| and 
demonstrates that communication requirements do not ad- 
versely impact the design. 

Quantum Fourier Transform: While the Quantum 
Fourier Transform (QFT) comprises a small fraction of the 
overall Shor's algorithm, it requires all-to-all personalized 
communication between data qubits. In addition, it uses 
only one-qubit and two-qubit gates which consume much 
less time. As a result, studying the performance of the QFT 
gives us an insight into how the CQLA will behave when 
faced with an communication heavy and a computation hght 
application. 

In the worst case, all nine data qubits (maximum capac- 
ity of the compute block) would have to be transferred to or 
from memory simultaneously. 

Between compute blocks, the QFT's all-to-all personal- 
ized communication must be supported on the CQLA mesh 
network. We leverage the vast amount of prior work done in 
studying mesh networks, and employ a near-optimal algo- 
rithm proposed in |37|. The total time for communication 
for varying problem sizes is shown in figure |8(b)| Note that 
while communication time is a little less, it closely tracks 
the computation time for all problem sizes. This is due to 
the difference in time to error correct a single logical qubit 
and the time to transport a single qubit; which stays constant 
immaterial of the problem size. 



7 Future Work 



A high-level goal of this work is to build abstractions 
from which architects and systems designers can examine 
open issues and help guide the substantial basic science and 
engineering under investment towards building a scalable 
quantum computer The primary focus of our work has been 
system balance. The driving force in this balance has been 
application parallelism. A key open issue is the restruc- 
turing of quantum algorithms to manage this parallelism in 
the context of system balance. From an architectural point 
of view, the most relevant abstract properties are density of 
functional components, the memory hierarchy and commu- 
nication bandwidth. 

While our work has focused on trapped ions, most scal- 
able technologies will have a similar two-dimensional lay- 
out where our techniques can be easily applied. This is be- 
cause the density is determined by the ratio of data to ancilla 
rather than physical details of the underlying technology. 

For ion-traps, lasers can also be a control issue. We 
plan to study how our architecture can minimize the number 
of lasers and minimize the power consumed by each laser, 
since power is proportional to fanout. Efficiently routing 
control signals to all electrodes in an ion-trap is a challeng- 
ing proposition, one that has not yet been considered for 
large systems. Currently, we perform the whole adder at 
the fast level 1 encoding or at the level 2 encoding; clever 
instruction scheduling techniques can allow us to improve 
performance by reducing granularity. 



8 Conclusion 

The technologies and abstractions for quantum comput- 
ing have evolved to an exciting stage, where architects and 
system designers can attack open problems without intimate 
knowledge of the physics of quantum devices. We explore 
the amount of parallelism available in quantum algorithms 
and find that a specialized architecture can serve our needs 
very well. The CQLA design is an example where archi- 
tectural techniques of specialization and balanced system 
design have led to up to a 13X improvement in density and 
a 8X increase in performance, while preserving fault tol- 
erance. We hope that further application of compiler and 
system optimizations will lead to even more dramatic gains 
towards a scalable, buildable quantum computer. 
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